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Abstract —Typical dimensionality reduction (DR) methods are 
often data-oriented, focusing on directly reducing the number 
of random variables (features) while retaining the maximal 
variations in the high-dimensional data. In unsupervised situ¬ 
ations, one of the main limitations of these methods lies in their 
dependency on the scale of data features. This paper alms to 
address the problem from a new perspective and considers model- 
oriented dimensionality reduction in parameter spaces of binary 
multivariate distributions. 

Specifically, we propose a general parameter reduction crite¬ 
rion, called Confldent-Information-First (GIF) principle, to max¬ 
imally preserve confident parameters and rule out less confident 
parameters. Formally, the confidence of each parameter can be 
assessed by its contribution to the expected Fisher information 
distance within the geometric manifold over the neighbourhood 
of the underlying real distribution. 

We then revisit Boltzmann machines (BM) from a model 
selection perspective and theoretically show that both the fully 
visible BM (VBM) and the BM with hidden units can be derived 
from the general binary multivariate distribution using the GIF 
principle. This can help us uncover and formalize the essential 
parts of the target density that BM aims to capture and the non- 
essential parts that BM should discard. Guided by the theoretical 
analysis, we develop a sample-specific GIF for model selection of 
BM that is adaptive to the observed samples. The method is 
studied in a series of density estimation experiments and has 
been shown effective in terms of the estimate accuracy. 

Index Terms —Information Geometry, Boltzmann Machine, 
Parametric Reduction, Fisher Information 

I. Introduction 

R ecently, deep learning models (e.g.. Deep Belief 
Networks (DBN) ||T|, Stacked Denoising Auto-encoder 
Q, Deep Boltzmann Machine (DBM) Q and etc.) have drawn 
increasing attention due to their impressive empirical perfor¬ 
mance in various application areas, such as computer vision 
0 0 natural language processing 0 and information 
retrieval Q. Despite of these practical successes, there 
have been debates on the fundamental principle of the design 
and training of those deep architectures. In most situations, 
searching the parameter space for deep learning models is 
difficult. To tackle this difficulty, unsupervised pre-training has 
been introduced as an important process. In p0|, it has been 
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empirically shown that the unsupervised pre-training could ht 
the network parameters in a region of the parameter space 
that could well capture the data distribution, thus alleviating 
generalization error of the trained deep architectures. 

The process of pre-training aims to discover the latent 
representation of the input data based on the learnt generative 
model, from which we could regenerate the input data. A better 
generative model would generally lead to more meaningful 
latent representations. Erom the density estimation point of 
view, pre-training can be interpreted as an attempt to recover 
a set of parameters for a generative model that describes the 
underlying distribution of the observed data. Since Boltzmann 
machines (BM) are building blocks for many deep architec¬ 
tures (e.g., DBN and DBM), we will focus on a formal analysis 
of the essential parts of the target density that the BM can 
capture in model selection. 

In practice, the datasets that we deal with are often high¬ 
dimensional. Thus we would require a model with high¬ 
dimensional parameter space in order to effectively depict the 
underlying real distribution. Overhtting usually occur when the 
model is excessively complex with respect to a small dataset. 
On the other hand, if a large dataset is available, underht- 
ting would occur when the model is too simple to capture 
the underlying trend of the data. Moreover, this connection 
becomes more complicated if the observed samples contain 
noises. Thus, to alleviate overhtting or underhtting, a basic 
model selection criterion is needed to adjust the complexity 
of the model with respect to the available observations (usually 
insufficient or perturbed by noises). Next, for density estima¬ 
tion, we will restate the model selection problem as parametric 
reduction on the parameter space of multivariate distributions, 
which could lead to our general parameter reduction criterion, 
i.e., the Conhdent-Information-Eirst (CIE) principle. 

Assuming there exists an universal parametric probabilistic 
model S (with n free parameters) that is general enough to 
represent all system phenomena, the goal of the parametric 
reduction is to derive a lower-dimensional sub-model M (with 
k « n free parameters) by reducing the number of free 
parameters in S. Note that the number of free parameters is 
adopted as a model complexity measure, which is in line with 
various model selection criteria (such as Akaike information 
criterion (AIC) E). Bayesian information criterion (BIC) | fT^ 
and etc). 

In this paper, we formalize the parametric reduction in 
the theoretical framework of information geometry (IG). In 
IG, the general model S can be seen as a n-dimensionality 
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Fig. 1. Illustration on parametric reduction: Let 5 be a two-dimensionality 
manifold with two free parameters 6 i and 62 , and Mi with free pai'ameter 9i 
and M 2 with free parameter 62 are the submanifold of 5; As an illustration in 
Euclidean space, we show Bs (on which the true distribution pt located on) as 
the surface of a hyper-ellipsoid centered at sample distribution ps determined 
by the Fisher-Rao metric; Only part of the original distance between pt and 
Ps {pt,Ps € S) can be preserved after projection on submanifold M; The 
preferred M is the one that maximally preserves the original distance after 
projection. Note that the scale of the distances in Fig ^ are shown as a demo, 
and are not exactly proportional to the real Riemann distances induced by 
Fisher-Rao metric 


manifold and M is a smoothed submanifold of S. The number 
of free parameters in M is restricted to be a constant k 
(k << n). Then, the major difficulty in the parametric 
reduction procedure is the choice of parameters to keep or 
to cut. In this paper, we propose to reduce parameters such 
that the original geometric structure of S can be preserved as 
much as possible after projecting on the submanifold M. 

Let pt,Ps S 5" be the true distribution and the sampling 
distribution (maybe perturbed from pt by sampling bias or 
noises) respectively. It can be assumed that the true distribution 
Pt is located somewhere in a e-sphere surface Bg centered at 
Ps, i.e., Bs = {pt G S\D{pt,Ps) = e}, where D(-,-) denotes 
the distance measure on the manifold S, and e is a small 
number. This assumption is made without losing generality, 
since the e is a small variable. For a distribution p, the best 
approximation of p on M is the point q that belongs to M 
and is the closest to p in terms of the distance measure, i.e., 
q = argmin^/gM which is dehned as the projection 

of p onto M (denoted as Tm{p))- 

Then, the parametric reduction can be dehned as the op¬ 
timization problem to maximally preserve the expectation of 
the Fisher information distance with respect to the constraint 
of the parametric number, when projecting distributions from 
the parameter space of S onto that of the reduced submanifold 
M: 

maximize i D{T M{pt),^ M{Ps))dBs 

JpteB,. ( 1 ) 

subject to M has k free parameters 

Here, the Fisher information distance (FID), i.e., the Rie- 
mannian distance induced by the Fisher-Rao metric is 


adopted as the distance measure between two distributions, 
since it is shown to be the unique metric meeting a set 
of natural axioms for the distribution metric 03) GD GD, 
e.g., the invariant property with respect to reparametrizations 
and the monotonicity with respect to the random maps on 
variables. Let ^ be the distribution parameters. For two close 
distributions pi and p 2 with parameters and ^ 2 , the Fisher 
information distance between pi and p 2 is: 


D{puP2) = yJ{^i-C2VG^{Ci-^2) ( 2 ) 

where is the Fisher information matrix Oil- 

Note that the solution to this optimization problem (Equa¬ 
tion [^l is not unique, since we can assign different hxed values 
for non-free parameters in M. Intuitively, to determine the 
appropriate values for non-free parameters, our best choice is 
the M that intersects at ps- However, in general cases where 
Ps is NOT specihed in advance, it is natural to assign non- 
free parameters to a neutral value (e.g., zero). This treatment is 
used by the general CIF (see Section [III]). If ps is specified in 
advance, we can, in principle, further select a M as close to ps 
as possible. It turn out that we can develop a sample-specific 
CIF w.r.t given samples (see Section 0. 

The rationality of maximally preserving the Fisher infor¬ 
mation distance can also be interpreted from the maximum- 
likelihood (ML) estimation point of view. Let ^ be the ML 
estimators for The asymptotic normality of ML estimation 
implies that the distribution of ^ is the normal distribution with 
mean ^ and covariance E, i.e., 

f{i) ~ m, S) = ^exp{-^-{^ - - 0} (3) 


where the inverse of E can be asymptotically estimated 
using the Fisher information matrix Gj, as suggested by the 
Cramer-Rao bound p7) and the asymptotic normality of ML 
estimation. From the Fisher information distance given in 
Equation]^ the exponent part of Equation|^is just the opposite 
of the half squared Eisher information distance between two 
distributions p and p determined by the close parameters ^ 
and respectively. Hence a large Eisher information distance 
means a lower likelihood. It turns out that, in density estimates, 
maximally preserving the expected Eisher information distance 
after the projection F m (Equation [TJ is equivalent to maxi¬ 
mally preserving the likelihood-structure among close distri¬ 
butions. In supervised learning (e.g., classihcation), maximally 
preserving EID can also effectively preserve the likelihood- 
structure among different class densities (the underlying distri¬ 
butions of classes), which is beneficial against sample noises. 
Recall that sample noises always reduce the EID among class 
densities in a statistical sense, which lead to the reduced 
discrimination marginality between two class densities. Hence, 
for noisy data, the model that maximally preserving EID can 
capture the dominant discrimination between class densities. 

To solve the optimization problem in Equation we 
propose a parameter reduction criterion called the Confident- 
Infonnation-First (CIE) principle, described as follows. The 
Eisher information distance D{pt,Ps) can be decomposed 
into the distances of two orthogonal parts |14|. Moreover, 
it is possible to divide the system parameters in S into two 
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categories (corresponding to the two decomposed distances), 
i.e., the parameters with “major” variations and the parameters 
with “minor” variations, according to their contributions to 
the whole information distance. The former refers to param¬ 
eters that are important for reliably distinguishing the true 
distribution from the sampling distribution, thus considered 
as “confident”. On the other hand, the parameters with minor 
contributions can be considered as less reliable. Hence, the CIF 
principle can be stated as parametric reduction that preserves 
the confident parameters and rules out less confident parame¬ 
ters. We will theoretically show that CIF leads to an optimal 
submanifold M in terms of the optimization problem defined 
in Equation[T] It is worth emphasizing that the proposed CIF as 
a principle of parametric reduction is fundamentally different 
from the traditional feature reduction (or feature extraction) 
methods ITS}, fig. The latter focus on directly reducing 
the dimensionality on feature space by retaining maximal 
variations in the data, e.g.. Principle Components Analysis 
(PCA) pO) , while CIF offers a principled method to deal with 
high-dimensional data in the parameter spaces by a strategy 
that is derived from the first principle independent of the 
scales of features. 

The main contributions of this paper are: 

1) We incorporate the Fisher information distance into the 
modelling of the intrinsic variations in the data that give 
rise to the desired model in the framework of IG. 

2) We propose a CIF principle for parametric reduction to 
maximally preserve the confident parameters and ruling 
out less confident ones. 

3) For binary multivariate distributions, we theoretically 
show that CIF could analytically lead to an optimal 
submanifold w.r.t. the parametric reduction problem in 
Equation [T] 

4) The utility of CIE, i.e., the derivation of probabilistic 
models, is illustrated by revisiting the Boltzmann ma¬ 
chines (BM). We show by examples that some existing 
probabilistic models, e.g., the fully visible BM (VBM) 
and the BM with hidden units, comply with the CIE 
principle and can be derived from it. 

5) Given certain samples, we propose a sample-specific 
CIE-based model selection scheme for the Bolzmann 
machines. It leads to a significant improvement in a 
series of density estimation experiments. 

II. Theoretical Eoundations oe IG 

In this section, we introduce and develop the theoretical 
foundations of IG for the manifold S of binary multivari¬ 
ate distributions with a given number of variables n, i.e., the 
open simplex of all probability distributions over binary vector 
X S {0,1}". This will lay the foundation for our theoretical 
deviation of the CIF. 

A. Notations for Manifold S 

In IG, a family of probability distributions is considered as a 
differentiable manifold with certain parametric coordinate sys¬ 


tems. In the case of binary multivariate distributions, four basic 
coordinate systems are often used HD 12T): p-coordinates, 77 - 
coordinates, 0-coordinates and the mixed (^-coordinates. The 
(^-coordinates is of vital importance for our analysis. 

Eor the p-coordinates [p], the probability distribution over 
2” states of x can be completely specified by any 2" — 1 pos¬ 
itive numbers indicating the probability of the corresponding 
exclusive states on n binary variables. Eor example, the p- 
coordinates of n = 2 variables could be [p] = (poi,PiojPii)- 
Note that IG requires all probability terms to be positive IB- 
Eor simplicity, we use the capital letters I, J,... to index the 
coordinate parameters of probabilistic distribution. An index 
I can be regarded as a subset of {1,2,..., n}. Additionally, 
Pj stands for the probability that all variables indicated by / 
equal to one and the complemented variables are zero. Eor 
example, if / = {1, 2,4} and = 4, we have: 

Pi = Piioi = Prob{xi = 1, X2 = 1, X3 = 0, X4 = 1) 


Note that the null set can also be a legal index of the p- 
coordinates, which indicates the probability that all variables 
are zero, denoted as po...o- 

The 77 -coordinates [ 77 ] are defined by: 

777 = E[Xi] = ProhiW X, = 1} (4) 

i&I 


where the value of Xj is given by Hie/ ** '^he expectation 
is taken with respect to the probability distribution over x. 
Grouping the coordinates by their orders, the 77 -coordinates are 
denoted as [rj] = (rjj, ... ,r]^ 2 n)’ where the superscript 
indicates the order number of the corresponding parameter. 
Eor example, r/fj denotes the set of all 77 parameters with the 
order number two. 

The 0-coordinates (natural coordinates) [0] are defined by: 
logp(x) = ^ 0^Xj - 'ip{e) (5) 

7C{l,2,...,n},7^7V«HSet 


where 7 / 7 ( 0 ) = log(X)a; expjX)/^*^>^7 (x)}) is the cumulant 
generating function and its value equals to — log Prob{xi = 
0,V7 S (1,2, ...,?T,}}. By solving the linear system]^ we have 
0^ = The 0-coordinate is denoted 

as [ 0 ] = { 91,62 ,... , 0 ^'' "’^), where the subscript indicates 
the order number of the corresponding parameter. Note that 
the order indices locate at different positions in [ 77 ] and [ 0 ] 
following the convention in | fT3) . 

The relation between coordinate systems [ 77 ] and [0] is 
bijective. More formally, they are connected by the Legendre 
transformation: 


j dfip) dtpie) 

= sar’’'-= Sir 


(6) 


where 7 / 7 ( 0 ) is given in Equation and (j){ri) = 

^^p(x; 77 ) logp(x; 77 ) is the negative of entropy. It can be 
shown that 7 / 7 ( 0 ) and ( 7 ( 77 ) meet the following identity p^: 


7/7(0) -f (/)(77) - ^ 0^777 = 0 (7) 


^The Fisher-Rao metric is considered as the first principle to measure the The /-mixed ^-coordinates are defined by; 
distance between distributions since it is the unique metric meeting a set of ^ 12 I i ' k i 

natural axioms for the distribution metric, as stated earlier. [C]^“[^ ; i Viji ■ • ■ iVi j k-> ' ; ■ • ■ ; ’^) (^) 
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where the first part consists of 77 -coordinates with order less or 
equal to I and the second part consists of ^-coordinates with 
order greater than I, I G {1,n — 1 }. 


Proposition 2.3: The Fisher information matrix of 
is given by; 


Gc = 


A 0 \ 
OB) 


( 11 ) 


B. Fisher Information Matrix for Parametric Coordinates 

For a general coordinate system [^], the i-th row and j- 
th column element of the Fisher information matrix for [^] 
(denoted by G^) is defined as the covariance of the scores of 

fe] and 


Qij — E[ 


d\ogp{x-,C) 

d^i 


d\ogp{x-,C) 


where A = {{G^ \ B = {{Gg ^)jJ \ G^ and Gg are 

the Fisher information matrices of [77] and \ 0 ], respectively, Ig 
is the index set of the parameters shared by [77] and [C];, i.e., 
{rjl, j f.}, and Jg is the index set of the parameters 
shared by [0] and [C];, Le., ,... 

Proof in Appendix 

III. The General GIF Principle 


under the regularity condition that the partial derivatives 
exist. The Fisher information measures the amount of in¬ 
formation in the data that a statistic carries about the 


unknown parameters |22|. The Fisher information matrix is 


of vital importance to our analysis, because the inverse of 
Fisher information matrix gives an asymptotically tight lower 
bound to the covariance matrix of any unbiased estimate for 
the considered parameters GZ)- Another important concept 
related to our analysis is the orthogonality defined by Fisher 
information. Two coordinate parameters and are called 
orthogonal if and only if their Fisher information vanishes, i.e., 
gij = 0 , meaning that their influences on the log likelihood 
function are uncorrelated. 

The Fisher information for [0] can be rewritten as gjj = 
and for [ 77 ], it is g^-^ = ||I^. Let Gg = {gu) 

and Gn = {g^^) be the Fisher information matrices for \0] and 
[ 77 ], respectively. It can be shown that Gg and are mutually 
inverse matrices, i.e., 9^"^9JK — ^K’ where — 1 \f 

I — K and zero otherwise GD- In order to generally com¬ 
pute Gg and G^, we develop the following Propositions 2.1 
and 


2.2 Note that Proposition 2.1 


generalization of 

Theorem 2 in Gl- 
Proposition 2.1: The Fisher information between two pa¬ 
rameters 6 ^ and 9'’ in [ 0 ], is given by: 


9u{^) = ViyjJ - riiVJ 

Proof in Appendix 


(9) 


Proposition 2.2: The Fisher information between two pa¬ 
rameters 77 / and rjj in [ 77 ], is given by; 


JJ 


9 


{v)= E (-1) 


\I-K\ + \J-K\ 


KCinJ 


1 

Pk 


( 10 ) 


where | • | denotes the cardinality operator. 

Proof in Appendix 

We take the probability distribution with three variables 
for example. Based on Equation the Fisher information 
between 77 / and 77 ? can be calculated, e.g., = — —I — — if 

/ = {1,2} and J = {2,3}, 0 ^“' = -h —-h —-h —) 

if / = {1, 2} and J = {1,2,3}, and etc. 

Based on G^ and Gg, we can calculate the Fisher informa¬ 
tion matrix G<^ for the [C]/. 


The general manifold S of all probability distributions over 
binary vector x G { 0 , 1 }" could be exactly represented using 
the 2” — 1 parametric coordinates. Given a target distribution 
q{x) G S, we consider the problem of realizing it by a lower- 
dimensionality submanifold M. This is defined as the problem 
of parametric reduction for multivariate binary distributions. 

In this section, we will formally illuminate the general GIF 
for parametric reduction. Intuitively, if we can construct a 
coordinate system so that the confidences of its parameters 
entail a natural hierarchy, in which high confident parameters 
are significantly distinguished from and orthogonal to lowly 
confident ones, then we can conveniently implement GIF 
by keeping the high confident parameters unchanged and 
setting the lowly confident parameters to neutral values. As 
described in Section [Ij the confidence of parameters should 
be assessed according to their contributions to the expected 
information distance. Therefore, the choice of coordinates in 
GIF is crucial to its usage. This strategy is infeasible in 
terms of p-coordinates, 77 -coordinates or ^-coordinates, since 
the orthogonality condition cannot hold in these coordinate 
systems. In this section, we will show that the (-mixed- 
coordinates meets the requirement of GIF. 

To grasp an intuitive picture for the general GIF strategy and 
its significance w.r.t mixed-coordinates [<}];, we will first show 
that the (-mixed-coordinates [C]/ meets the requirement of GIF 
in typical distributions that generate real-world datasets. Then 
we will prove that GIF could lead to an optimal submanifold 
w.r.t. the parametric reduction problem in Equation [T] in 
general cases. 

A. The CIF in Typical Distributions 

To facilitate our analysis, we make a basic assump¬ 
tion on the underlying distributions q{x) that at least 
( 2 ” — 2 "/^) p-coordinates are of the scale e, where e 
is a sufficiently small value. Thus, residual p-coordinates 
(at most 2 "/^) are all significantly larger than zero (of 
scale 0 (l/ 2 ("’/^))), and their sum approximates one. Note 
that these assumptions are common situations in real- 
world data collections p3j , since the frequent (or mean¬ 
ingful) patterns are only a small fraction of all of the 
system states. 

Next, we introduce a small perturbation Ap to the p- 
coordinates [p] for the true distribution q(x). The perturbed 
distribution is denoted as q'{x). Eor p-coordinates that are 
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significantly larger than zero, the scale of each fluctuation 
Ap/ is assumed to be proportional to the standard variation 
of corresponding p-coordinate pj by some small coefficients 
(upper bounded by a constant a), which can be approximated 
by the inverse of the square root of its Fisher information via 
the Cramer-Rao bound. It turns out that we can assume the 
perturbation Api to be a^/pj. For p-coordinates with a small 
value (approximates zero), the scale of each fluctuation Ap/ 
is assumed to be proportional to apj. 

In this section, we adopt the Z-mixed-coordinates = 
where Z = 2 is used in the following analysis. 
Let ACq = (A? 7 ^“;A 02 +) be the incremental of mixed- 
coordinates after the perturbation. The squared Fisher infor¬ 
mation distance D^{q,q') = {A(q)'^G(;A(q could be de¬ 
composed into the direction of each coordinate in We 
will clarify that, under typical cases, the scale of the Fisher 
information distance in each coordinate of (will be reduced 
by CIF) is asymptotically negligible, compared to that in each 
coordinate of (will be preserved by CIF). 


The scale of squared Fisher information distance in the 
direction of rjj is proportional to Arji ■ {G(;)ij ■ Arjj, where 
is the Fisher information of rjj in terms of the mixed- 
coordinates [C] 2 . From Equation H for any I of order one 
(or two), rji is the sum of 2 "“~(or 2 "’“^) p-coordinates, 
and the scale is 0(1). Hence, the incremental Arf'~ is 
proportional to 0(1), denoted as a • 0(1). It is difficult to give 
an explicit expression of {Gi^)ij analytically. However, the 
Fisher information {G(^)j j of rjj is bounded by the (/, /)-th 
element of the inverse covariance ma trix p4| , which is exactly 
(see Proposition 




2.3 I. Hence, the scale of 
is also 0(1). It turns out that the scale of squared 
Fisher information distance in the direction of 77 / is • 0(1). 


Similarly, for the part 62 +, the scale of squared Fisher 
information distance in the direction of 9'^ is proportional to 
• (Gc), 7 ,j • A 9 J, where (Gc)j,j is the Fisher information 
of in terms of the mixed-coordinates [()] 2 . The scale of 
is maximally f{k)\log{^/€)\ based on Equation]^ where 
k is the order of 9'' and f{k) is the number of p-coordinates 
of scale 0 ( 1 / 2 ^"/^^) that are involved in the calculation of 
0*^. Since we assume that f{k) < 2 (”G)^ the maximum scale 
of 9^ is 2^”/^^|Zop(-\/e)|. Thus, the incremental A9'' is of a 
scale bounded by a ■ 2^"/^^|Zo( 7 (-\/e)|. Similar to our previous 
deviation, the Eisher information (G<^)j,j of 9'^ is bounded by 
the (J, J)-th element of the inverse covariance matrix, which 
is exactly l/gjjiq) (see Proposition |2.3| l. Hence, the scale of 
(G^)p,/ is (2^ — f{k))~^e. In summary, the scale of squared 
Eisher information distance in the direction of 9'^ is bounded 
by the scale of a? ■ 0 ( 2 "g l^J(f^^^l ). Since e is a sufficiently 
small value and a is constant, the scale of squared Eisher 
information distance in the direction of 9'^ is asymptotically 
zero. 


According to our above analysis, the conhdences of coordi¬ 
nate parameters (measured by the decomposed Eisher informa¬ 
tion distance) in [C]/ entail a natural hierarchy; the hrst part of 
high conhdent parameters [ 77 *“] are signihcantly larger than the 
second part of low conhdent parameters [0;+]. Additionally, 
those low conhdent parameters [ 0 /+] have the neutral value 


TABLE I 

Simulation on the FID preserved by (/ = 2) 


n 

ratio of preserved 
parameters 

ratio of preserved FID 

mean 

Standard deviation 

3 

0.857 

0.9972 

0.0055 

4 

0.667 

0.9963 

0.0043 

5 

0.484 

0.9923 

0.0054 

6 

0.333 

0.9824 

0.0112 

7 

0.220 

0.9715 

0.0111 



Fig. 2. By projecting a point q{x) on 5 to a submanifold M, the l- 
tailored mixed-coordinates [C,]i^ gives a desirable M that maximally preserves 
the expected Fisher information distance when projecting a ^-neighborhood 
centered at q'(a)) onto M. 


of zero. Moreover, the parameters in [ 77 *“] are orthogonal to 
the ones in [ 0 ;+], indicating that we could estimate these two 
parts independently m- Hence, we can implement the CIE 
for parametric reduction in by replacing low conhdent 
parameters with neutral value zero and reconstructing the 
resulting distribution. It turns out that the submanifold of S 
tailored by CIE becomes = ( 77 ,^,..., 77 '^ ^, 0,..., 0). We 
call the Z-tailored-mixed-coordinates. 

To verify our theoretical analysis, we conduct a simulation 
on the ratio of EID that is preserved by the Z-tailored-mixed- 
coordinates (Z = 2) [C]/j w.r.t. the original mixed-coordinates 
[C] ■ Eirst we randomly select real distribution pt with n vari¬ 
ables, where the distribution satishes the basic assumption that 
we make in the beginning of this section (the 2"/^ signihcant 
p-coordinates are generated based on Jeffery prior, left p- 
coordinates are set to a small constant). Then we generate the 
sample distribution ps based on random samples drawn from 
the real distribution. Last, we calculate the EID between pt and 
Ps in terms of the [^] and respectively. The result is shown 
in Table [I] We can see that the [C];^ can indeed preserve most 
of the EID, which is consistent with our theoretical analysis. 

B. The CIF Leads to an Optimal Submanifold M 

Let Bq be a e-sphere surface centered at q{x) on manifold 
S, i.e., Bq = {q' G S\\\KL{q,q') = e}, where KL{-,-) 
denotes the KL divergence and e is small. Additionally, 
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q'{x) is a neighbor of q{x) uniformly sampled on Bq, as 
illustrated in Figure Recall that, for a small e, the KL 
divergence can be approximated by half of the squared Fisher 
information distance. Thus, in the parameterization of [(^];, Bq 
is indeed the surface of a hyper-ellipsoid (centered at q{x)) 
determined by The following proposition shows that the 
general CIF would lead to an optimal submanifold M that 
maximally preserves the expected information distance, where 
the expectation is taken upon the uniform neighborhood, Bq. 

Proposition 3.1: Consider the manifold S in Z-mixed- 
coordinates [(^];. Let k be the number of free parameters 
in the Z-tailored-mixed-coordinates Then, among all k- 
dimensional submanifolds of S, the submanifold determined 
by can maximally preserve the expected information 
distance induced by the Fisher-Rao metric. 

Proof in Appendix 

IV. CIF-based Interpretation oe Boltzmann 
Machine 

In previous section, a general CIF is uncovered in the 
coordinates for multivariate binary distributions. Now we 
consider the implementations of CIF when I equals to 2 using 
the Boltzmann machines (BM). 


A. Introduction to the Boltzmann Machines 

In general, a BM p5) is defined as a stochastic neural 
network consisting of visible units x G {0,1}”“= and hidden 
units h G {0, !}”'*, where each unit fires stochastically 
depending on the weighted sum of its inputs. The energy 
function is defined as follows: 


Ebm{.x, h; = —^x'^Ux — ^h^Vh — x^Wh — x — d^h 

( 12 ) 

where ^ = {U,V,W,b,d] are the parameters: visible-visible 
interactions (t/), hidden-hidden interactions (V), visible- 
hidden interactions (W), visible self-connections (b) and hid¬ 
den self-connections (d). The diagonals of U and V are set 
to zero. We can express the Boltzmann distribution over the 
joint space of x and h as below: 


p{x,h;^) = ^exp{-EBM{x,h]^)} 


(13) 


where Z is a normalization factor. 

1) The Coordinates for Boltzmann Machines: 

Let B be the set of Boltzmann distributions realized by 
BM. Actually, B is a submanifold of the general manifold 
Sxh over {x,h]. From Equation (13i and (12i, we can see 
that ^ = {!/, V, b, d} plays the role of i?’s coordinates in 
^-coordinates (Equation]^ as follows: 


O 2 

92 + 


di' = bxi,9f = dh^i^Xi G x,hj G h) 

^ Vhi.hj , (Vxj, Xj G X] hi, hj G h) 

Qx^...x,h,,...K = o,m > 2, 

(Wxi ,..., Xj G X, hy^ ,..., hy G hlj (14) 


So the 0-coordinates for BM is given by: 



1 —order 2—order orders>2 


The VBM and restricted BM are special cases of the general 
BM. Since VBM has Uh = 0 and all the visible units are 
connected to each other, the parameters of VBM are ^ybm = 
{U,b} and {V,W,d} are all set to zero. Eor RBM, it has 
connections only between hidden and visible units. Thus, the 
parameters of RBM are ^rbm = {W,b, d} and {U,V} are set 
to zero. 

2) The Gradient-based Learning of BM: 

Given the sample x that generated from the underlying 
distribution, the maximum-likelihood (ML) is a commonly 
used gradient ascent method for training BM in order to 
maximize the log-likelihood logp(a:;^) of the parameters f 
|2^ . Based on Equation 0, the log-likelihood is given as 
follows: 


logpfeO = logY^e-^^^EC _ log ^ 

h x',h' 


Differentiating the log-likelihood, the gradient with respect to 
^ is as follows: 

d[-E{x,h-,0] 


dlogp{x-,C) 








x' ,h' 


where can be easily calculated from Equation (12i. 

Then we can obtain the stochastic gradient using Gmns 
sampling p7] in two phases: sample h given x for the first 
term, called the positive phase, and sample {xf,h[) from the 
stationary distribution p{x' ,h'\^) for the second term, called 
the negative phase. Now with the resulting stochastic gradient 
estimation, the learning rule is to adjust ^ by: 

d\ogp{x-,Cj 


= 


/dE{x, 




)oo (17) 


where e is the learning rate, (-jo denotes the average using the 
sample data and (•)oo denotes the average with respect to the 
stationary distribution p{x, Zi; after the corresponding Gibbs 
sampling phases. 

To avoid the difficulty of computing the log-likelihood 
gradient, the Contrastive divergence (CD) | |26) realizes the 
gradient descent of a different objective function, shown as 
follows: 


= 


OC 


^ d{KL{po\\p) - KL{prn\\p)) 

,dE{x,h-0, , ,dE{xL,hL-0, ,,,, 

\-oe-/o + \-- Im Gol 


where po is the sample distribution, pm is the distribution by 
starting the Markov chain with the data and running m steps 
and KL{-\\-) denotes the KL divergence. CD can be seen as 
an approximation to ML by replacing the last expectation (•)oo 
with {■)yn- 
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B. The Fully Visible Boltzmann Machine 

Consider the parametric reduction on the manifold S over 
{x} and end up with a /c-dimensional submanifold M of S, 
where k 2"“’—! is the number of free parameters in M. 
M is set to be the same dimensionality as VBM, i.e., k = 
nx(n^x+i) ^ candidate submanifolds are comparable to 

the submanifold Mytm endowed by VBM. Next, the rationale 
underlying the design of can be illuminated using the 

general CIF. 

1} The Derivation of VBM via CIF: 

In the following corollary, we will show that the statistical 
manifold My^rn is the optimal parameter subspace spanned by 
those directions with high confidences in terms of CIF. 

Corollary 4.1: Given the general manifold S in 2-mixed- 
coordinates [C] 2 , VBM (with coordinates [C] 2 t) defines an k- 
dimensional submanifold of S that can maximally preserve 
the expected Fisher information distance induced by Fisher- 
Rao metric. 

Proof in Appendix 

2 ) The Interpretation of VBM Learning via CIF: 

To learn such [C] 2 t, we need to learn the parameters ^ of 
VBM such that its stationary distribution preserves the same 
coordinates [ 77 ^“] as target distribution q{x). Actually, this 
is exactly what traditional gradient-based learning algorithms 
intend to do. Next proposition shows that the ML learning of 
VBM is equivalent to learn the coordinates [C] 2 f 

Proposition 4.2: Given the target distribution q{x) with 2- 
mixed coordinates: 

[C]2 = 

the coordinates of the stationary distribution of VBM trained 
by ML are uniquely given by: 

[C] 2 t = {r]l,T]fj,d 2 + = 0) 

Proof in Appendix 

C. The Boltzmann Machine with Hidden Units 

In previous section, the CIF is applied to models without 
hidden units and leads to VBM by preserving the 1-order and 
2-order //-coordinates. In this section, we will investigate the 
cases where hidden units are introduced. 

Let Sxh be the manifold of distributions over the joint space 
of visible units x and hidden units h. A general BM produces 
a stationary distribution p{x,h\^) G Sxh over {x,h}. Let B 
denote the submanifold of Sxh with probability distributions 
p{x,h;^) realizable by BM. 

Given any target distribution q{x), only the marginal dis¬ 
tribution of BM over the visible units are specified, leaving 
the distributions on hidden units vary freely. Let Hg be the 
submanifold of Sxh with probability distributions q{x, h) that 
have the same marginal distribution as q{x) and the conditional 
distribution q{h\x) of hidden units is realised by the BM’s 
activation functions with some parameter ^hm- 

Then, the best BM is the one that minimizes the distance 
between B and Hg. Due to the existence of hidden units, the 
solution may not be unique. In this section, the training process 



Fig. 3. The iterative learning for BM: in searching for the minimum 
distance between Hq and B, we first choose an initial BM po then 
perform projections Tuip) and F b{<i) iteratively, until the fixed points of the 
projections p* and q* are reached. With different initializations, the iterative 
projection algorithm may end up with different local minima on Hg and B, 
respectively. 

of BM is analysed in terms of manifold projection (described 
in Section |I]), following the framework of the learning rule 
proposed in p3| . And we will show that the invariance in the 
learning of BM is the CIF. 

1) The Iterative Projection Learning for BM: 

The learning algorithm using iterative manifold projection 
is first proposed in HD and theoretically compared to EM 
(Expectation and Maximization) algorithm in | [28) . The learn¬ 
ing of RBM can be implemented by the following iterative 
projection process: Let be the initial parameters of BM 
and pQ{x,h\^^) be the coiTesponding stationary distribution. 

Eor z = 0,1, 2,..., 

1) P\i.tq,+i{x,h)=TH{Pt{x,h\Cp)) 

2) Put pi+i{x,h]Cp^^) =^B{q^+l{x,h)) 

where Tnip) denotes the projection of p{x,h;^p) to Hg, and 
rB(( 7 ) denotes the projection of q{x,h) to B. The iteration 
ends when we reach the fixed points of the projections p* 
and q*, that is Tnip*) = q* and rB(g*) = p*. The iterative 
projection process is illustrated in Eigure The convergence 
property of this iterative algorithm is guaranteed using the 
following proposition: 

Proposition 4.3: The monotonic relation holds in the itera¬ 
tive learning algorithm: 

D[qi+i,pi] > D[qi+i,pi+i] > D[qi+2,Pi+i] ( 19 ) 

where the equality holds only for the fixed points of the 
projections. 

Proof in Appendix 

Next two propositions show how the projection PHip) and 
F B (q) are obtained. 

Proposition 4.4: Given a distribution p{x, h\ ^p) G B, the 
projection Th{p) & Hg that gives the minimum divergence 
D{Hg,p{x, h; ^p)) from Hg to p{x, h] ^p) is the q{x, h; G 
Hg that satisfies ^bm = Cp- 







Proof in Appendix [H| 

Proposition 4.5: Given q{x,h;^q) G Hg with mixed co¬ 
ordinates: 02 +), the 

coordinates of the learnt projection Tsiq) G B are uniquely 
given by the tailored mixed coordinates: 




2 2 2 
VxiXj 5 Vxihj 5 Vhihi 


02 + = 0 ) ( 20 ) 


p{x; ^)) to meet the data, we could incorporate the data into 
GIF by recognizing the confidence of parameters ^ in terms 
of q{x). Then, parametric reduction procedure can be further 
applied to modify the topology of VBM adaptively according 
to the data, as shown in Algorithm [T] and explained as in the 
following. Note that this algorithm can also be used in BM 
with hidden units, such as vRBM (see Section |V-C|l. 


Proof This proof comes in three parts: 

1) the projection F siq) of q{x, h) on B is unique; 

2 ) this unique projection TB{q) can be achieved by min¬ 
imizing the divergence D[q{x,h),B] using gradient 
descent method; 

3) The mixed coordinates of F siq) is exactly the one given 
in Equation ( |20) i. 

See Appendix 1^ for the detailed proof. 


2 ) The Interpretation for BM Learning via CIF: 

The iterative projection learning (IP) gives us an alternative 
way to investigate the learning process of BM. Based on the 
CIF principle in Section we can see that the process of 
the projection TB{qi) can be derived from CIF, i.e., highly 
confident coordinates [ril,,Tll^,vl^^^,7ll^h,^0l,h,] of ft are 
preserved while lowly confident coordinates [ 02 +] are set to 


neutral value zero, given in Equation 20 


In summary, the essential parts of the real distribution 
that can be learnt by BM (with and without hidden units) 
are exactly the confident coordinates indicated by the CIE 
principle. 


V. Experimental Study on Sample-specieic CIE 

In this section, we will empirically investigate the sample- 
specific CIE principle in density estimation tasks for Boltz¬ 
mann machines. More specifically, we aim to adaptively 
determine free parameters in BM, such that BM can trained be 
as close to the sample distribution as possible w.r.t. specific 
samples. Eor VBM, we will investigate how to use CIE to 
modify the topology of VBM by reducing less confident 
connections among visible units with respect to given samples. 
Eor BM with hidden units, we extend the traditional restricted 
BM (RBM) by allowing connections among visible units, 
called vRBM. Then we apply CIE on vRBM to emphasis the 
learning on confident connections among visible units. 


A. The Sample-specific CIF: Adaptive Model Selection of BM 
Based on our theoretical analysis in Section |nl| and Section 


approximating distributions in an expected sense. However, 
for the distribution with specific samples, can CIF further 
recognize less-confident parameters and reduce them prop¬ 
erly! Next, inspired by the general CIE, we introduce an 
adaptive network design for VBM based on given samples, 
which could automatically balance the model complexity of 
BM and the sample size. The data constrains the state of 
knowledge about the unknown distribution. Let q{x) denote 
the sampling distribution (representing the data). In order 
to force the estimate of our probabilistic model (denoted as 


IV-B 




BM uses the most confident information (i.e., [ 77 ^ ]) for 


Algorithm 1 Adaptive Network Design for BM 

Input; Samples D = {di,d 2 , ■ ■ ■, dw}; Significance level a; 

Nodes V = {xi,X 2 , ■ ■ ■ ,Xn}', Edges U = {Uij,yxi,Xj}-, 
Output: Set of confident edges Uconf C U 
Uconf G- {}; a G- 0.05 
for Uij gU do 

Estimate marginal distribution p{xi,Xj) from samples 
** parameterize to ^-coordinates: [C] ** 

Tli ^ Bp[Xi],7]j 

0^^ G- logpoo - logPol - logpio + logPll 

[C] ^ 


** Eisher information of in [^] ** 

g G- —I— - —I— - —I— 

** confidence of 0 *^ in ** 

P^J G- 0^ • g ■ 0*J 


** hypothesis test: py = 0 against pij 7 *^ 0 ** 
TT ^ cdf^2(i){Npij) 
if (1 — tt) • 2 < a then 

** reject null hypothesis: pij = 0 ** 

Uconf ^ Uconf U 

end if 
end for 
return Uconf 


As a graphical model, the VBM comprises a set of vertices 
V = {xi,X 2 , ■ ■ ■ ,Xn} together with a set of connections 
U = {Uij,\/xi, Xj,i f j}. The confidence for each connection 
parameter Uij can be assessed by the parameter selection 
criterion in CIE, i.e., the contribution to the Eisher information 
distance. Based on the Theorem 1 in | [T3) , Uij could be 
expressed as follows: 

_ p{xi = Xj = 1| A) ■ p{xi = Xj = 0| A) 

°^p(xz = l,Xj = 0|A) -pixi = 0,Xj = 1|A) 

where the relation hold for any conditions A on the rest 
variables. However, it is often infeasible for us to calculate 
the exact value of Uij because of data sparseness. To tackle 
this problem, we propose to approximate the value of Uij by 
using the marginal distribution p{xi, Xj) to avoid the effect of 
condition A. 

Let = { 7 ii, 7 ij, 0 ^^) be the mixed-coordinates for the 
marginal distribution p{xi,Xj) of VBM. Note that each 0®-^ 
corresponds to one connection Uij. Since 0®-^ is orthogonal 
to Tji and Tjj, the Eisher information distance between two 
distributions can be decomposed into two independent parts: 
the information distance contributed by {pi, and {0®-^}. Eor 
the purpose of parameter reduction, we consider the two close 
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distributions pi and p 2 with coordinates 
and C ,2 — {’iij’iijO} respectively. The confidence of 0®-^, 
denoted as p{0^^), can be estimated by its contribution to 
Fisher information distance between pi and p 2 : 


p[e^i) = (Cl - C2 )^Gc(Ci - C2) = 0^^ ■ g^{e^^) ■ ( 2 i) 


where G(^ is the Fisher information matrix in Proposition 2.3 
and gc^iO^^) is the Fisher information for 0®-^. Note that the 
second equality holds since 0 ®-^ is orthogonal to 771 and py 
To decide whether the Fisher information distance in the 
coordinate direction of 0 ®^ is significant or negligible, we set 
up the hypothesis test for p, i.e., null hypothesis p = 0 versus 
alternative p 7 ^ 0. Based on the analysis in we have 
Np ^ X^(l) asymptotically, where the is chi-square 

distribution with degree of freedom 1 and N is the sampling 
number. For example, for coordinate 0®^ with Np = 5.024, 
we have a 95% confidence to reject the null hypothesis. That 
is, we can ensure that the Fisher information distance in the 
direction of 0®^ is significant with probability 95%. 


Model selection for VBM: density estimation task 



B. Experiments with VBM 


In this section, we investigate the density estimation perfor¬ 
mance of CIF-based model selection methods for VBM. Three 
methods are compared: 


1 ) 


2 ) 


3) 


Rand-CV: we perform random selection of VBM’s con¬ 
nections and the best model is selected based on fc-fold 
cross validation. 

CIF-CV: connections are selected in descend order 


based on their confidences (defined in Equation 21 1 , 
constrained on the number of model free parameters. 
Then, the best model is selected based on fc-fold cross 
validation. 

CIF-Htest: the topology of VBM is determined by the 
adaptive algorithm described in Algorithm [T] 


1) Experimental on artificial dataset: 

The artificial binary dataset is generated as follows: we 
first randomly select the target distribution q{x), which is 
randomly chosen from the open probability simplex over 
the n random variables using the Jeffreys prior pO) . Then, 
the dataset with N samples are generated from q{x). For 
computation simplicity, the artificial dataset is set to be 10 - 
dimensional. The CD learning algorithm is used to train the 
VBMs. 

The Full-VBM, i.e., the VBM with full connections are 
used as baseline, k is set to 5 for cross validation. KL 
divergence is used to evaluate the goodness-of-fit of the VBM 
trained by various algorithms. For sample size N, we run 20 
randomly generated distributions and report the averaged KL 
divergences. Note that we focus on the case that the variable 
number is relatively small (n = 10 ) in order to analytically 
evaluate the KL divergence and give a detailed study on 
algorithms. Changing the number of variables only offers a 
trivial influence for experimental results since we obtained 
qualitatively similar observations on various variable numbers 
(not reported here). 

Results and Summary: The averaged KL divergences 
between VBM and the underlying real distribution are shown 


Fig. 4. Density estimation results for VBM 


in Ligure|^ We can see that all model selection methods could 
improve density estimation results of VBM, especially when 
the sample size is small (N=100 to 1100). With relatively large 
samples, the effect of parameter reduction gradually becomes 
marginal. 

Comparing CIL-CV with Rand-CV, the performances on 
relatively small sample size (N=100, 300, 500) are similar 
(see Ligurej^. This is because that the VBM reduced by cross 
validation is the trivial one with no connections. In Ligure 
(first column), we illustrate how the KL divergence between 
VBM and real/sample distribution changes along with different 
model complexities of VBM. We can see that although the 
CIL is worse than Rand in terms of the KL divergence 
between VBM and the real distribution in most setups of 
model complexity, CIL is better in estimating the sampling 
distribution. This is consistent with our previous theoretical 
insight that CIL preserves the most confident parameters in 
VBM where the confidence is estimated based on sampling 
distribution. 

As sample size increases (N=700, 900, 1100), the CIL-CV 
gradually outperforms Rand-CV (see Ligure]^. This could be 
explained by the CIL principle. The CIL preserves the most 
confident parameters of VBM with respect to sample distri¬ 
bution. As sample size increases, the sampling distribution 
grows closer to real distribution, which could benefit CIL in the 
way that the KL divergence with both sample/real distributions 
can be simultaneous better than Rand, as shown in Ligure 
(second and third column). 

With relatively large sample size (N=1500, 3000), the CIL- 
CV and Rand-CV have similar performance in terms of the 
KL divergence between VBM and the real distribution (see 
Ligure |^. This is because complex model is preferred when 
the samples are sufficient and both CIL-CV and Rand-CV tend 
to select the trivial VBM with full connections. Therefore, the 
difference between CIL-CV and Rand-CV becomes marginal. 
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sample size N = 100 




sample size N = 3000 






Fig. 5. Changes of KL divergence between VBM and sample/real distributions w.r.t. model complexity. The model complexity is measured by the ratio = ( 
sum of confidences for preserved connections)/( total confidences for all connections) 


However, CIF is still more powerful in describing sample/real 
distributions compare to Rand, as shown in Figure (fourth 
column). 

For the two CIF-based algorithm, CIF-Htest is worse than 
CIF-CV when sample size is small and gradually achieves 
similar or better performance with CIF-CV along with the 
increasing sample size. The main advantage of CIF-Htest, 
w.r.t., CIF-CV, is that there is no need for the time-consuming 
cross validation. 

In summary, the CIF-based model selection (e.g., CIF-CV 
and CIF-Htest) indicates the balance between VBM’s model 
complexity and the amount of information learnt from sam¬ 
ples. For nontrivial cases of model selection (trivial cases are 
the selected VBMs with no connections or full connections), 
CIF-based methods could reduce the KL divergence between 
VBM and real distributions without hurting the VBM too 
much in describing sample distribution, as compared to Rand 
(Fig. 0- Moreover, the CIF-Htest provides us an automatic 
way to adaptively select suitable VBM with respect to given 
samples. 

2) Experiments on real datasets: 

In this section, we empirically investigate how the CIF- 
based model selection algorithm works on real-world datasets 
in the context of density estimation. In particular, we use the 
VBM to learn the underlying probability density over 100 
terms of the 20 News Groups binary dataset, with different 
model complexities (changing the ratio of preserved Fisher 
information distance). There are 18000 documents in 20 News 
Groups in total, which is partitioned into two set: train set 
(80%) and test set (20%).The learning rate for CD is manually 
tuned in order to converge properly and all set to 0.01. Since 
it is infeasible to compute the KL devergence due to the high 
dimensionality, the averaged Hamming distance between the 


Model selection for VBM: density estimation on 20 News groups 



Fig. 6. Performance changes on real dataset w.r.t. model complexity 


samples in the dataset and those generated from the VBM 
is used to evaluate the goodness-of-fit of the VBM’s trained 
by various algorithms. Let D — {di, d 2 , ■ ■ ■, d^} denote the 
dataset of N documents, where each document di is a 100- 
dimensional binary vector. To evaluate a VBM with parameter 
^vbm, we first randomly generate N samples from the station¬ 
ary distribution p{x; ^vbm), denoted as V = {ui, V 2 , ■ ■ ■, vn}- 
Then the averaged hamming distance Dham is calculated as 
follows: 

D/iam L J = - — - 

where Ham[diTVj\ is the number of positions at which the 
corresponding values are different. 

Three kinds of VBMs are compared: the full VBM without 
parametric reduction; the VBMs of different model complex- 
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ities using Rand; the VBMs of different model complexities 
using CIF. After training all VBMs on the training dataset, 
we evaluate the trained VBM on the test dataset. The re¬ 
sult is shown in Figure We also mark the VBM that is 
automatically selected by CIF-Htest. We can see that the 
model selection (CIF and Rand) achieves significantly better 
performances than the full VBM for a wide range of r, which 
is consistent with our observations with the experiments on 
artihcial datasets when the samples is insufficient. And the 
best performance for CIF outperforms that of Rand. The 
performance of CIF-Htest (ratio=0.3) is also shown in Fig 
which is close to the optimal solution (ratio=0.2). 


C. Experiments with vRBM 

The BM with hidden units is practically more interesting 
than VBM, since it has higher representation power. Particu¬ 
larly, one of the fundamental problem in neural network re¬ 
search is the unsupervised representation learning | |3T) , which 
attempts to characterize the underlying distribution through 
the discovery of a set of latent variables (or features). The 
restricted BM (RBM) is one of the most widely used models 
for learning one level of feature extraction. Then, in deep 
learning models |[T], the representation learnt at one level is 
used as input for learning the next level. In this section, we 
will extend the RBM by allowing connections among visible 
units (called vRBM, shown in Figure and further investigate 
the CIF-based model selection algorithm (see Section V-Al 
empirically. Note that in the model selection for vRBM, 
only parameters in U (connections within visible units) are 
affected and all parameters in {W, b, d} (connections between 
hidden and visible units, visible/hidden self-connections) are 
preserved, so as to maintain the structure of vRBM. 

1) Experimental Setup: 

For computational simplicity, the artihcial dataset is of 10 
dimensionality, and the number of hidden units in vRBM is 
set to 10. For the model selection of vRBM, three methods 
are compared: CIF-CV, Rand-CV, CIF-Htest, which are the 
same model selection scheme for VBM in Section IV-BI The 
standard RBM is adopted as baseline. Note that for model 
selection, we are actually adding connections among visible 
units to the standard RBM. The learning algorithms is CD. 
KL divergence is used to evaluate the goodness-of-ht of the 
vRBM’s trained by various algorithms. 

2) Results and Summary: 

The averaged KL divergences between vRBM and the 
underlying distribution are shown in Figure For a small 
sample size (N=100,300), there is not so much need to increase 
the model complexity of BM and hence adding connections 
among visible units does not improve the density estimation 
results. While, as sample size increases (from 500 to 3000), 
model selection methods gradually outperforms standard RBM 
significantly by adding connections among visible units. 

Comparing CIF-CV with Rand-CV, the performances on 
relatively small sample size (N=100, 300) are similar (see 
Figure |^. This is because that the vRBM selected by cross 
validation is the trivial one with no connections added to RBM. 
In Figure |^(hrst column), we illustrate how the KL divergence 



Hidden 

Layer 


Visible 

Layer 


Fig. 8. The network structure of vRBM 


Model selecction for vRBM: density estimation task 



Fig. 9. Density estimation results of vRBM 


between vRBM and real/sample distribution changes along 
with different model complexities of vRBM. We can see that 
CIF is better in estimating the sample distribution compared 
with Rand (first column, hrst row). Similary with VBM, 
the performance for estimating the real distribution of CIF 
becomes similar or worse than Rand along with the increasing 
model complexity (ratio> 0.6) due to overhtting (hrst column, 
second row). 

As sample size increases (N=500 to 1500), the CIF-CV 
gradually outperforms Rand-CV (see Figure |^. This could 
be explained by the CIF principle. The CIF-CV preserves 
the most conhdent parameters of vRBM with respect to 
sample distribution. As sample size increases, the sampling 
distribution grows closer to real distribution, which could 
beneht CIF in the way that the KL divergence with both 
sample/real distributions can be simultaneous better than Rand 
for all model complexities, as shown in Figure [7] (second and 
third column). 

With larger sample size (N>=3000), the CIF-CV and Rand- 
CV have similar performance in terms of the KL divergence 
between vRBM and the real distribution. This is because 
complex model is preferred when the samples are sufficient 
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sample size N = 100 


sample size N = 900 


sample size N = 1100 


sample size N = 3000 



ratio of the preserved Fisher information distance 


Fig. 7. Changes of KL divergence between vRBM and sample/real distributions w.r.t. model complexity. The model complexity is measured by the ratio = ( 
sum of confidences for preserved visible connections) / ( total confidences for all visible connections) 


and both CIF-CV and Rand-CV tend to select the trivial vRBM 
with all connections between visible units being added to 
RBM. Therefore, the difference between CIF-CV and Rand- 
CV becomes marginal. Flowever, CIF are still more powerful 
in describing sample distribution compare to Rand for the 
vRBM with the same number of connections, as shown in 
Figure |7] (fourth column). 

For the two CIF-based algorithm, CIF-Htest is worse than 
CIF-CV when sample size is small and gradually outperforms 
CIF-CV along with the increasing sample size. 

In summary, the CIF-based model selection could bal¬ 
ance between vRBM’s model complexity and the amount of 
information learnt from samples and could simultaneously 
reduce the KL divergence between vRBM and real/sample 
distributions. This indicates that the CIF is also useful for BM 
with hidden units. 



Fig. 10. A multi-layer BM with visible units x and hidden layers 
and . The greedy layer-wise training of deep architecture is to maximally 
preserve the confident information layer by layer. Note that the prohibition 
sign indicates that the Fisher information on lowly confident coordinates is 
not preserved. 


VI. Conclusions and Future works 

In this paper, we study the parametric reduction and model 
selection problem of Boltzmann machines from both theoret¬ 
ical and applicational perspectives. On the theoretical side, 
we propose the CIF principle for the parametric reduction 
to maximally preserve the confident parameters and ruling 
out less confident ones. For binary multivariate distributions, 
we theoretically show that CIF could lead to an optimal 
submanifold in terms of Equation Furthermore, we illustrate 
that the Boltzmann machines (with or without hidden units) 
can be derived from the general manifold based on CIF 
principle. In future works, the CIF could be the start of an 
information-oriented interpretation of deep learning models 
where BM is used as building blocks. For deep Boltzmann 
machine (DBM) Q, several layers of RBM compose a deep 
architecture in order to achieve a representation at a suffi¬ 
cient abstraction level. The CIF principle describes how the 


information flows in those representation transformations, as 
illustrated in Figure 10 We propose that each layer of DBM 
determines a submanifold M of S, where M could maximally 
preserve the highly confident information on parameters. Then 
the whole DBM can be seen as the process of repeatedly 
applying CIF in each layer, achieving the tradeoff between 
the abstractness of representation features and the intrinsic 
information confidence preserved on parameters. The more 
detailed analysis on deep models will be left as further works. 


On the applicational side, we propose a sample-specific CIF- 
based model selection scheme for BM, i.e., CIF-Htest, that 
could automatically adapt to the given samples. It is studied in 
a series of density estimation experiments. In the further work, 
we plan to incorporate the CIF-Htest into deep learning models 
(such as DBM) to modify the network topology such that the 
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9IJ = 


most confident information in data can be well captured. 
Appendix A 

Proof of Proposition I2.1I 
Proof By definition, we have; 

~ WW 

where '0(6*) is defined by Equation (j7|. Hence, we have: 

- 0(t?)) ^ dr]j_ 

d 0 ^d 0 J dO-J 

By differentiating rji, defined by Equation Q, with respect to 
0 ^, we have: 

dm _dY,^Xi{x){exp{Y,i0^Xi{x)-i}{0)}) 
9IJ QQJ - - QQJ 

= X! - ? 7 j]p(a:; 0) = p/uj - Vi0J 

X 

This completes the proof. 

Appendix B 

Proof of Proposition I2.2I 
Proof By definition, we have: 

glj ^ 


dtiidrij 

where (/)(p) is defined by Equation (j^. Hence, we have; 

^ driidrij dr]j 

Based on Equations (j^ and (Q, the 0^ and pK could be calcu¬ 
lated by solving a linear equation of [p] and [p], respectively. 
Hence, we have: 


= ^'logiPK)-, pk= XI 


\J-K\ 


0J 


KCI 


KCJ 


Therefore, the partial derivation of 6 ^ with respect to r]j is: 


9 "= 


80^ 




d 0 ^ dpK 


dm ^dpK dm PK 


parameters shared by [ 77 ] and [C]; and that Jg is the index set 
of the parameters shared by [0] and [C];; we have (G0^)/^ = 
and i.e., 

{g-^)g 


G- = 


0 

-y-l\ 


0 

Since G^ is a block tridiagonal matrix, the proposition follows. 
Appendix D 

Proof of Proposition I3.1I 

Proof Let Bg be a e-ball surface centered at q{x) on manifold 
S, i.e., Bq = {q' G S\\\KL{q,q') = e}, where KL{-,-) 
denotes the Kullback-Leibler divergence and e is small, is 
the coordinates of q{x). Let q{x) + dqhe a neighbor of q{x) 
uniformly sampled on Bg and Cq(^x)+dq be its corresponding 
coordinates. Eor a small e, we can calculate the expected 
square Eisher information distance between q{x) and q{x)+dq 
as follows; 




■:)+dq Cq) G Q{C,q(^x)+dq Cq)dBq ( 22 ) 


where G(^ is the Eisher information matrix at q{x). 

Since Eisher information matrix G^ is both positive definite 
and symmetric, there exists a singular value decomposition 
G(^ = U^AU where U is an orthogonal matrix and A is a 
diagonal matrix with diagonal entries equal to the eigenvalues 
of Gc (all >0), i.e., (Ai, A 2 ,..., A„). 

Applying the singular value decomposition into Equation 
|, the expectation becomes: 


Eb=, 


This completes the proof. 

Appendix C 

Proof of Proposition I2.3I 

Proof The Eisher information matrix of [C] could be parti¬ 
tioned into four parts: Gq = ^ ^. It can be verified 

that in the mixed coordinate, the 0-coordinate of order k is 
orthogonal to any p-coordinate less than fc-order, implying the 
corresponding element of the Eisher information matrix is zero 
(G = D = 0) 1^ . Hence, Gq is a block diagonal matrix. 

According to the Cramer-Rao bound GZ)’ a parameter (or 
a pair of parameters) has a unique asymptotically tight lower 
bound of the variance (or covariance) of the unbiased estimate, 
which is given by the corresponding element of the inverse 
of the Eisher information matrix involving this parameter (or 
this pair of parameters). Recall that Ig is the index set of the 


)+dq - CqV AU{Qg(x) + dq ” QdBg ( 23 ) 

Note that U is an orthogonal matrix, and the transformation 
U{Cq{x)+dq ~ Cq) ^ norm-preserving rotation. 

Now, we need to show that among all tailored fc-dimensional 
submanifolds of S, is the one that preserves maximum 
information distance. Assume It = . ■. ,ik} is the 

index of k coordinates that we choose to form the tailored 
submanifold T in the mixed-coordinates [C]. 

Eirst, according to the fundamental analytical properties of 
the surface of the hyper-ellipsoid, we will show that there 
exists a strict positive monotonicity between the expected 
information distance Eb,^ for T and the sum of eigenvalues 
of the sub-matrix (G^)/.^,. 

Based on the definition of hyper-ellipsoid, the e-ball sur¬ 
face Bq is indeed the surface of a hyper-ellipsoid (centered 
at q{x)) determined by G^. Let the eigenvalues of G<; be 
Ai > A 2 > • • • > A„ > 0. Then, the surface integral on 
Bq can be decomposed as follows: 

L'b, = J v'^AvdBq 

= J diag{Xi)vdBq -\- J diag{\2)vdBg 

-!-••• + J u'"diag[Xn)vdBg 

where v = {(^q(x)+dq — Cq)U and diag{Xi) is the diagonal 
matrix with the main diagonal (0,..., Ai,..., 0). We can 
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show the monotonicity between the integration values and 
eigenvalues: 


is a negative correlation between the components in sin0 


and a/ r^+(r|—rj) cos^ 9. Then, the inequality 


25 


is proved. 


f rj, f T f T Therefore, the monotonicity 24 holds for the two-dimensional 

/ V diag{\i)v > v diag{X 2 )v >•••>/ v c^*a5(^n)gnipsoid. 

’ ’ (24) Similarly, the above Reimann integral analysis on the spheri¬ 

cal coordinates can be extended to the n-dimensional ellipsoid. 


The surface of the ellipsoid Bg may be parameterized in 
several ways. In terms of Cartesian coordinate systems, the 
equation of Bg is: 


_L -I-£ 

^2 ^ „2 
! ^ I o 




1 ' 2 

where rf denotes the squares of the semi-axes and is deter¬ 
mined by the reciprocals of the eigenvalues, i.e., rf = 
Consider the two-dimensional ellipsoid Bg-. 


"1 "2 / A 

“2 H-2 “ V 1 ^ ^ 2 ) 

rf rf 

Then we can transform the Cartesian coordinates to Spherical 
coordinates as follows: 

ui = ri • cos 9 
V 2 = r 2 • sin 9 

where 0 < 0 < 2tt. To prove that Jg diag{Xi)v > 
Jg v^diag{X 2 )v, we need to show that: Jg |cos6*| > 

/g |sin6*|. Since the ellipsoid is symmetric, we only need 
to prove that: 

j cos9 > j sin0, 0S[O, 

By reformulating the above surface integral in terms of 
definite integral, we have: 


cos 




rf—rf) cos^ ^ ^ / sin0 

Jo 


rf+{rf—rf) sin^ 9 
(25) 


Next we will prove the above inequality based on the 
definition of Riemann integral. The integral interval [0, §] of 
9 can be partitioned into a finite sequence of subintervals, i.e., 
[9i,9i+i], where 


0 = 6»o < • • • < < 9i+i < 


<^m- 2 


Let P = max[j<i<^n\9i+i — 9i\ be the maximum length of 
subintervals. When p approaches infinitesimal (hence n — 
00 ), the definite integral equals to the Riemann integral, i.e., 
the limit of the Riemann sums. Therefore, the inequality [25]can 
be transformed in to the comparison between two limitations: 

lim cos 9-rl+(^rf—rf) cos^ 9 > lim sin 9-rf+frf—rf) cos^ 9 


(26) 

where the Rie mann sums are denoted by vecto r rp ulti- 
plications and cos0 = (cos 0o,..., cos 0„) and sin0 = 
(sin^oj ■ • ■ jSin^n). Since cosO = sin (( and cos f = sinO, 
cos 9 and sin0 can be seen as two vectors that share the 
same components while arranged in different orders. We 
can also see that there is a positive correlation between the 

components in cos 9 and rf+frf—rl) cos^ 9, while there 


Hence, the monotonicity 24 holds for standard ellipsoids. 


Thus, based on the monotonicity the expected Fisher 
information distance that can be preserved by a /c-dimensional 
standard ellipsoid is monotonic with the sum of eigenvalues 
of the selected k eigenvectors. To maximize the preserved 
information distance, we should choose the top-A: eigenvalues, 
i.e., Ai,..., Afc. 

Since is a block diagonal matrix, the eigenvalues of G(; 
are the combined eigenvalues of its blocks. Based on Lemma 


D.l the elements on the main diagonal of the sub-matrix A 


are lower bounded by one and those of B upper bounded by 
one. Since the the sum of eigenvalues equals to the trace, the 
top-fc eigenvalues is exactly the eigenvalues of sub-matrix A. 
Thus, we have: 

Emax = J diag{Xi)vdBg -f • • • -f- J diag{Xk)vdBg 


':)+dq Cq) 


A 

0 


(Cq(x)+dq - Cq)dBg (27) 


Now, let us consider the best selection of coordinates It in 
[(^]. It is easy to see that the rotation operation U in Equation 
[2^ does not affect the integral value of the expected Fisher 
information distance. Therefore, based on Equation [Z7| It = 
gives the maximum Fisher information distance. This 
completes the proof 

Lemma D.l: For Fisher information matrix G(^, the diagonal 
elements of A are lower bounded by one, and those of B are 
upper bounded by one. 

Proof Assume the Fisher information matrix of \9] to be: 
' U X ' 


Ge = 


V 


, which is partitioned based on Ig and 


Jg. Based on Proposition 2.3 we have A = U Obviously, 
the diagonal elements of U are all smaller than one. According 


to the succeeding Lemma D.2 we can see that the diagonal 
elements of A (i.e., U~^) are greater than one. 

Next, we need to show that the diagonal elements of 
B are smaller than 1. Using the Schur complement of 
Gg, the bottom-right block of Gg^, i.e., (G^^)ja, equals 
to (V — X'^U~^X)~^. Thus, the diagonal elements of B: 
Bjj = {V — X'^U~^X)jj < Vjj < 1. Hence, we complete 
the proof. 


Lemma D.2: With a I x I positive definite matrix H, if 
Hu < 1, then {H-^)u > G {1,2,... ,1}. 

Proof Since H is positive definite, it is a Gramian matrix 
of I linearly independent vectors vi,V 2 , ■ ■ ■ ,vi, i.e., Hij = 
{vi,Vj) ({■,■) denotes the inner product). Similarly, H~^ 
is the Gramian matrix of I linearly independent vectors 
Wi,W 2 , ■. ■ ,wi and = {wi,Wj). It is easy to verify 

that {wi,Vi) = 1,VA S {1,2,...,^}. If Hu < 1, we can see 
that the norm ||ui|| = \^Hu < 1. Since ||wi|| x ||ui|| > 
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{wi,Vi) = 1, we have ||z(;i|| > 1. Hence, {H ^)ii = 
{Wi,Wi) = ||wi|p > 1. 


Appendix E 

Proof of PropositionWJJ 


Proof Let be the set of all probability distributions 


realized by SBM. 1131 proves that the mixed-coordinates of the 
resulting projection P on is [C]p = , 0), 

given the 2-mixed-coordinates of q{x). My^m is equivalent 
to the submanifold tailored by CIF, i.e. [C] 2 f The corollary 


follows from Proposition 3.1 


Appendix F 

Proof of Proposition s. 21 

Proof Based on Equation the coordinates [92+] for VBM 
is zero: 02+ = 0. Next, we show that the stationary distribution 
p{x;^) learnt by ML has the same [r]l,r]'^j] with q{x). 

For VBM, the can be easily calculated from Equa¬ 

tion ( [T^ : 


dE(x-,^) 

dE(xii) 

db^. 


— XiXj , 
= Xi, 


for Uxix^ G 
for e 


Thus, based on Equationthe gradients for Ux^Xj ,bxi G ^ 
are as follows: 


- {xiXj)oo = riij{q{x)) - rlij{p{x\0) 
= {xfjo - {xi)oo = vliqix)) - vl{p{x;0) 


where {■)q denotes the average using the sample data and {■)ao 
denotes the average with respect to the stationary distribution 

p{x; 0 - 

Since VBM defines an e-flat submanifold of S 

then ML converges to the unique solution that gives the best 
approximation p{x] G Mybm of q{x). When ML converges, 
we have —>• 0 and hence ^ q Thus, we can 

see that ML converges to stationary distribution p(a;; that 
preserves coordinates [ril,T]fj] of q{x). This completes the 
proof. 


Appendix G 

Proof of Proposition s. 31 

Proof Since pi G B and Pi+i G B is the projection of qi+i, 
then D[qi+i,pi] > D[q.i+i,p^+i]. Similarly, q^+i G Hg and 
qi +2 G Hg is the projection of p^+i, thus D[qi+i,pi+i] > 
D[qi+ 2 ,Pi+i]. This completes the proof. 


Appendix H 

Proof of Proposition s. 41 

Proof Based on the definition of divergence, the following 
relation holds: 


D[q{x,h),p{x,h)] = D[q{x)q{h\x),p{x)p{h\x)] 
r. q{x) , q(h\x), 

= D[q{x),p{x)] + Eg(^x)[D[qih\x),p{h\x)]] 


where Eg(^x,h)[] Eq(^x)\] the expectations taken over 

q{x, h) and q{x) respectively. 

Therefore, the minimum divergence between p{x^ h; £,p) and 
Hg is given as: 

D{Hg,p{x,h-,^p)) = min D[q{x,h;^q),p{x,h-,^p)] 
= mm{D[q{x),p{x)] + Eq{x)[D[q{h\x; Q,p{h\x; ^p)]]} 

= D[q{x),p{x)] +mm{Eq:^x)[D[q{h\x;Q,p{h\x-,^p)]]} 

^q 

= D[q{x),p{x)] 

In the last equality, the expected divergence between q{h\x-,^q) 
and p{h\x] ^p) vanishes if and only if ^g = ^p. This completes 
the proof|^ 


Appendix I 

Proof of Proposition s. 51 

Proof First, we prove the uniqueness of the projection ^siq)- 
From the [0] of BM in Equation (15i, B is an e-flat smooth 
submanifold of Sxh- Thus the projection is unique. 

Second, in order to And the p{x, h\ ^p) G B with parameter 
^p that minimizes the divergence between q{x,h;^q) G Hg 
and B, the gradient descent method iteratively adjusts in 
the negative gradient direction that the divergence D\q,p{^p)\ 
decreases fastest: 

, dD[q,p{fp)] 


ACp = -A- 


df.P 


where D[q,p{f^p)] is treated as a function of BM’s parameters 
^p and A is the learning rate. As shown in 1321, the gradient 
descent method converges to the minimum of the divergence 
with proper choices of A, and hence achieves the projection 
point Ts^q)- 

Last, we show that the mixed coordinates [C^^]rs(( 3 ) in 
Equation ( |20| ) is exactly the convergence point of the ML 
learning for BM. For distributions on the manifold Sxhi 
the states of all hidden units is also visible and hence 
the BM with hidden units is equivalent to VBM by treat¬ 
ing hidden units as visible ones. Based on Proposition 


4.2 ML converges to the projection point Tsiq) with a 


stationary distribution p{x, h\ ^p) that preserves coordinates 

of q{x,h;^g). This completes the 

proof. 
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